calibration guarantee
Robust Decision Making with Partially Calibrated Forecasts
Kiyani, Shayan, Hassani, Hamed, Pappas, George, Roth, Aaron
Calibration has emerged as a foundational goal in ``trustworthy machine learning'', in part because of its strong decision theoretic semantics. Independent of the underlying distribution, and independent of the decision maker's utility function, calibration promises that amongst all policies mapping predictions to actions, the uniformly best policy is the one that ``trusts the predictions'' and acts as if they were correct. But this is true only of \emph{fully calibrated} forecasts, which are tractable to guarantee only for very low dimensional prediction problems. For higher dimensional prediction problems (e.g. when outcomes are multiclass), weaker forms of calibration have been studied that lack these decision theoretic properties. In this paper we study how a conservative decision maker should map predictions endowed with these weaker (``partial'') calibration guarantees to actions, in a way that is robust in a minimax sense: i.e. to maximize their expected utility in the worst case over distributions consistent with the calibration guarantees. We characterize their minimax optimal decision rule via a duality argument, and show that surprisingly, ``trusting the predictions and acting accordingly'' is recovered in this minimax sense by \emph{decision calibration} (and any strictly stronger notion of calibration), a substantially weaker and more tractable condition than full calibration. For calibration guarantees that fall short of decision calibration, the minimax optimal decision rule is still efficiently computable, and we provide an empirical evaluation of a natural one that applies to any regression model solved to optimize squared error.
$\beta$-calibration of Language Model Confidence Scores for Generative QA
Manggala, Putra, Mastakouri, Atalanti, Kirschbaum, Elke, Kasiviswanathan, Shiva Prasad, Ramdas, Aaditya
To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is on average indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce $\beta$-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving $\beta$-calibration.
Calibrated Uncertainty Quantification for Operator Learning via Conformal Prediction
Ma, Ziqi, Azizzadenesheli, Kamyar, Anandkumar, Anima
Operator learning has been increasingly adopted in scientific and engineering applications, many of which require calibrated uncertainty quantification. Since the output of operator learning is a continuous function, quantifying uncertainty simultaneously at all points in the domain is challenging. Current methods consider calibration at a single point or over one scalar function or make strong assumptions such as Gaussianity. We propose a risk-controlling quantile neural operator, a distribution-free, finite-sample functional calibration conformal prediction method. We provide a theoretical calibration guarantee on the coverage rate, defined as the expected percentage of points on the function domain whose true value lies within the predicted uncertainty ball. Empirical results on a 2D Darcy flow and a 3D car surface pressure prediction task validate our theoretical results, demonstrating calibrated coverage and efficient uncertainty bands outperforming baseline methods. In particular, on the 3D problem, our method is the only one that meets the target calibration percentage (percentage of test samples for which the uncertainty estimates are calibrated) of 98%.
Distribution-free calibration guarantees for histogram binning without sample splitting
Gupta, Chirag, Ramdas, Aaditya K.
In classification, the goal is to learn a model that uses observed feature measurements to make a class prediction on the categorical outcome. However, for safety-critical areas such as medicine and finance, a single class prediction might be insufficient and reliable measures of confidence or certainty may be desired. Such uncertainty quantification is often provided by predictors that produce not just a class label, but a probability distribution over the labels. If the predicted probability distribution is consistent with observed empirical frequencies of labels, the predictor is said to be calibrated [Dawid, 1982]. In this paper we study the problem of calibration for binary classification; let X and Y " t0, 1u denote the feature and label spaces. We focus on the recalibration or post-hoc calibration setting, a standard statistical setting where the goal is to recalibrate existing ('pre-learnt') classifiers that are powerful and (statistically) efficient for classification accuracy, but do not satisfy calibration properties out-of-the-box. This setup is popular for recalibrating pre-trained deep nets. For example, Guo et al. [2017, Figure 4] demonstrated that a pre-learnt ResNet is initially miscalibrated, but can be effectively post-hoc calibrated. In the case of binary classification, the pre-learnt model can be any arbitrary function that provides a classification'score' g: X Ñ r0, 1s.
Distribution-free binary classification: prediction sets, confidence intervals and calibration
Gupta, Chirag, Podkopaev, Aleksandr, Ramdas, Aaditya
We study three notions of uncertainty quantification---calibration, confidence intervals and prediction sets---for binary classification in the distribution-free setting, that is without making any distributional assumptions on the data. With a focus towards calibration, we establish a 'tripod' of theorems that connect these three notions for score-based classifiers. A direct implication is that distribution-free calibration is only possible, even asymptotically, using a scoring function whose level sets partition the feature space into at most countably many sets. Parametric calibration schemes such as variants of Platt scaling do not satisfy this requirement, while nonparametric schemes based on binning do. To close the loop, we derive distribution-free confidence intervals for binned probabilities for both fixed-width and uniform-mass binning. As a consequence of our 'tripod' theorems, these confidence intervals for binned probabilities lead to distribution-free calibration. We also derive extensions to settings with streaming data and covariate shift.